image-2.png

ACKNOWLEDGMENTS

As per word count constraints, I would like to give a quick and concise, but greatly needed appreciation to lectures Daqing Chen and George Bamfo for their guidance and supervision, especially during tutorials where they provided regular and bespoke feedback on the following Big Data Analytics project.

INTRODUCTION

1. Business Understanding

The following report outlines analysis of a dataset concinnated by LSBU and provided by Havering burough on car accidents occurrences along with other relevant variables which were present during the incident. Some examples of attributes provided along with accidents include, weather conditions, road conditions, police attendance and severity.

The dataset provided by the Department for Transport outlines accident information in the burough of Havering. The data provided comes with two excel files. One containing the main data, where most columns are coded with numbers. And another where the code has mappings to its labels. Due to the recent changes in the econumy, councils just like Havering have been facing a budget crunch. For example, an article released by (Romford Recorder, 2023) on the 5th of august 2023 outlined how Havering council reject a petition proposed by citizens where they requested a lolipop lady outside two schools. By conducting analysis on the data provided to us by Department for Transport, we would be able to implement and recommend safer alternatives to Havering council, in turn allowing them to signficantly increase citizen road safety.

An authority monitoring report published by (Havering Councul, 2021-2022) outlined that 57% of students in Havering walk to school. Though this is a part of their green initiative, the signficatly high number of pupils walking to school poses a great risk to pedesterian safety. This business question has been raised in section 1.2.

BUSINESS QUESTIONS

1) How is the severity of accidents impacted by weather conditions, road surface conditions and the light conditions?

2) How is pedestrian safety / number of casualties is affected by different road types, for example type of junction (junction detail), type of pedestrian crossing (Pedestrian Crossing-Physical) and Road Type?

3) What is the Severity of accidents in a particular road class impacted by the speed limit in rural vs urban areas?

datatype.PNG

To gain a better idea of the geographical location of Havering, a plotted map of accidents is shown below (interactive).

In [6]:
import folium
from folium.plugins import HeatMap

# Create a base map
m = folium.Map(location=[filtered_df['Latitude'].mean(), filtered_df['Longitude'].mean()], zoom_start=10)

# Create a HeatMap layer to visualize accident density
heat_data = [[row['Latitude'], row['Longitude']] for index, row in filtered_df.iterrows()]
HeatMap(heat_data).add_to(m)

# Add markers for individual accidents (optional)
for index, row in filtered_df.iterrows():
    folium.Marker([row['Latitude'], row['Longitude']],
                  icon=folium.Icon(color='red', icon='info-sign')).add_to(m)

# Save the map as an HTML file
m.save('accidents_map.html')

# Display the map in Jupyter Notebook (optional)
m
Out[6]:
Make this Notebook Trusted to load map: File -> Trust Notebook

2. DATA UNDERSTANDING

Importing libraries

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA
import scipy.stats as stats
from sklearn.preprocessing import MinMaxScaler, StandardScaler

Defining width/length of outputs

In [8]:
pd.set_option('display.max_columns', None)  
pd.set_option('display.width', 1000)        
pd.set_option('display.max_colwidth', 50)   
pd.set_option('display.max_rows', 50)

Importing initial dataset

In [9]:
df = pd.read_csv('AccidentLondonBoroughs2223.csv')

Filtering by Havering Burough

In [10]:
filtered_df = df[df['Local_Authority_Highway'] == 'E09000016']

An overview of the dataset:

In [11]:
filtered_df.head()
Out[11]:
Accident_Index Location_Easting_OSGR Location_Northing_OSGR Longitude Latitude Police_Force Accident_Severity Number_of_Vehicles Number_of_Casualties Date Day_of_Week Time Local_Authority_District Local_Authority_Highway 1st_Road_Class 1st_Road_Number Road_Type Speed_limit Junction_Detail Junction_Control 2nd_Road_Class 2nd_Road_Number Pedestrian_Crossing-Human_Control Pedestrian_Crossing-Physical_Facilities Light_Conditions Weather_Conditions Road_Surface_Conditions Special_Conditions_at_Site Carriageway_Hazards Urban_or_Rural_Area Did_Police_Officer_Attend_Scene_of_Accident LSOA_of_Accident_Location
7803 200501JI40887 550080 189210 0.164701 51.581719 1 3 2 1 09/12/2005 6 16:00 15 E09000016 3 12 3 30 6 2 5 0 0 5 1 1 2 0 0 1 2 E01002329
7855 200501KD00005 551030 189060 0.178337 51.580117 1 3 1 1 13/01/2005 5 23:37 15 E09000016 3 125 1 30 1 4 3 118 0 0 4 1 2 0 0 1 1 E01002243
7856 200501KD00007 554800 188530 0.232471 51.574333 1 3 1 1 22/01/2005 7 05:54 15 E09000016 6 0 6 30 0 -1 -1 0 0 0 4 1 2 0 0 1 1 E01002271
7857 200501KD00009 552150 183990 0.192297 51.534262 1 2 1 1 22/01/2005 7 09:14 15 E09000016 6 0 6 30 3 4 6 0 0 0 1 1 1 0 0 1 1 E01002372
7858 200501KD00010 555900 191850 0.249805 51.603859 1 3 3 5 16/01/2005 1 00:38 15 E09000016 3 12 3 50 0 -1 -1 0 0 0 4 1 2 0 0 2 1 E01002296

The code below outputs the numbers of rows and columns (atributes) for the dataset provided by havering burough. There are 4203 rows and 32 columns/atributes.

2.1 NUMBER OF INSTANCES AND CURRENT DATATYPES

In [12]:
num_of_instances = filtered_df.shape[0]
print ("Number of rows =", num_of_instances)

num_of_atributes = filtered_df.shape[1]
print ("Number of columns/atributes =", num_of_atributes)
Number of rows = 4023
Number of columns/atributes = 32

The snippet below outlines the current datatypes. It is observed that they have not been converted to numerical this is expected considering the two excel sheets provided (main and mapping).

In [13]:
data_types = filtered_df.dtypes

print(data_types)
Accident_Index                                  object
Location_Easting_OSGR                            int64
Location_Northing_OSGR                           int64
Longitude                                      float64
Latitude                                       float64
Police_Force                                     int64
Accident_Severity                                int64
Number_of_Vehicles                               int64
Number_of_Casualties                             int64
Date                                            object
Day_of_Week                                      int64
Time                                            object
Local_Authority_District                         int64
Local_Authority_Highway                         object
1st_Road_Class                                   int64
1st_Road_Number                                  int64
Road_Type                                        int64
Speed_limit                                      int64
Junction_Detail                                  int64
Junction_Control                                 int64
2nd_Road_Class                                   int64
2nd_Road_Number                                  int64
Pedestrian_Crossing-Human_Control                int64
Pedestrian_Crossing-Physical_Facilities          int64
Light_Conditions                                 int64
Weather_Conditions                               int64
Road_Surface_Conditions                          int64
Special_Conditions_at_Site                       int64
Carriageway_Hazards                              int64
Urban_or_Rural_Area                              int64
Did_Police_Officer_Attend_Scene_of_Accident      int64
LSOA_of_Accident_Location                       object
dtype: object

2.2 STATISTICS AND MISSING VALUES

The below outputs indept stats of the initial dataset (pre changing datatypes). This includes the Count, Minimum, 25%, 50% (Median), 75%, Maximum, Range, Average, Standard Deviation, Kurtosis, Skewness, Mode.

In [14]:
# Calculate mode for all columns (both numeric and categorical)
mode_df = filtered_df.mode().iloc[0]

# Filter out only numeric columns for other statistics
numeric_filtered_df = filtered_df.select_dtypes(include='number')

# Calculate min, max, and range for numeric columns
min_df = numeric_filtered_df.min()
max_df = numeric_filtered_df.max()
range_df = max_df - min_df

# Calculate other statistics for numeric columns
average_df = numeric_filtered_df.mean()
median_df = numeric_filtered_df.median()
std_dev_df = numeric_filtered_df.std()
kurtosis_df = numeric_filtered_df.kurtosis()
skewness_df = numeric_filtered_df.skew()

# Calculate the quantiles
quantiles_df = numeric_filtered_df.quantile([0.25, 0.5, 0.75])

# Creating a summary DataFrame
summary_df = pd.DataFrame({
    'Count': numeric_filtered_df.count(),
    'Minimum': min_df,
    '25%': quantiles_df.loc[0.25],
    '50% (Median)': quantiles_df.loc[0.5],  # same as 'Median': median_df
    '75%': quantiles_df.loc[0.75],
    'Maximum': max_df,
    'Range': range_df,
    'Average': average_df,
    'Standard Deviation': std_dev_df,
    'Kurtosis': kurtosis_df,
    'Skewness': skewness_df,
    'Mode': mode_df
})

# Transpose the summary DataFrame to get statistics on the x axis
summary_df_transposed = summary_df.T

# Display the transposed summary DataFrame
display(summary_df)
Count Minimum 25% 50% (Median) 75% Maximum Range Average Standard Deviation Kurtosis Skewness Mode
1st_Road_Class 4023.0 1.000000 3.000000 4.000000 5.000000 6.000000 5.000000 4.136962 1.401634 -0.652785 -0.378333 3.0
1st_Road_Number 4023.0 0.000000 0.000000 12.000000 124.000000 1459.000000 1459.000000 120.072086 318.600348 10.775402 3.500065 0.0
2nd_Road_Class 4023.0 -1.000000 -1.000000 3.000000 6.000000 6.000000 7.000000 2.938852 3.006084 -1.610266 -0.359590 6.0
2nd_Road_Number 4023.0 0.000000 0.000000 0.000000 0.000000 1459.000000 1459.000000 49.775292 214.473027 31.679152 5.694489 0.0
Accident_Index NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 200501JI40887
Accident_Severity 4023.0 1.000000 3.000000 3.000000 3.000000 3.000000 2.000000 2.871738 0.357423 6.987306 -2.733735 3.0
Carriageway_Hazards 4023.0 0.000000 0.000000 0.000000 0.000000 7.000000 7.000000 0.058663 0.566809 125.884568 11.001825 0.0
Date NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 09/09/2006
Day_of_Week 4023.0 1.000000 3.000000 4.000000 6.000000 7.000000 6.000000 4.152374 1.940025 -1.200559 -0.066330 6.0
Did_Police_Officer_Attend_Scene_of_Accident 4023.0 1.000000 1.000000 1.000000 1.000000 3.000000 2.000000 1.210788 0.411560 0.400787 1.482759 1.0
Junction_Control 4023.0 -1.000000 -1.000000 4.000000 4.000000 4.000000 5.000000 2.041511 2.267430 -1.615848 -0.475183 4.0
Junction_Detail 4023.0 0.000000 0.000000 3.000000 3.000000 9.000000 9.000000 2.483719 2.447798 -0.156134 0.807195 3.0
LSOA_of_Accident_Location NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN E01002344
Latitude 4023.0 51.499662 51.556432 51.574174 51.590789 51.629950 0.130288 51.570070 0.027281 -0.276835 -0.616860 51.525146
Light_Conditions 4023.0 1.000000 1.000000 1.000000 4.000000 7.000000 6.000000 1.825255 1.399026 0.086025 1.245593 1.0
Local_Authority_District 4023.0 15.000000 15.000000 15.000000 15.000000 15.000000 0.000000 15.000000 0.000000 0.000000 0.000000 15.0
Local_Authority_Highway NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN E09000016
Location_Easting_OSGR 4023.0 548990.000000 551350.000000 553000.000000 554480.000000 561130.000000 12140.000000 553175.664927 2314.280843 -0.163872 0.672003 550630.0
Location_Northing_OSGR 4023.0 180220.000000 186555.000000 188460.000000 190330.000000 194700.000000 14480.000000 188006.726324 3038.903553 -0.256393 -0.619748 187420.0
Longitude 4023.0 0.148362 0.182388 0.205988 0.227358 0.322103 0.173741 0.208823 0.033457 -0.201826 0.676372 0.21244
Number_of_Casualties 4023.0 1.000000 1.000000 1.000000 1.000000 10.000000 9.000000 1.319911 0.723835 19.114208 3.504527 1.0
Number_of_Vehicles 4023.0 1.000000 1.000000 2.000000 2.000000 14.000000 13.000000 1.870743 0.696440 31.706350 2.681392 2.0
Pedestrian_Crossing-Human_Control 4023.0 0.000000 0.000000 0.000000 0.000000 2.000000 2.000000 0.004971 0.089059 404.257047 19.553205 0.0
Pedestrian_Crossing-Physical_Facilities 4023.0 0.000000 0.000000 0.000000 0.000000 8.000000 8.000000 0.694258 1.706158 3.889252 2.299839 0.0
Police_Force 4023.0 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.0
Road_Surface_Conditions 4023.0 1.000000 1.000000 1.000000 1.000000 5.000000 4.000000 1.232662 0.527284 9.946214 2.824944 1.0
Road_Type 4023.0 1.000000 3.000000 6.000000 6.000000 9.000000 8.000000 4.954263 1.844636 -0.172400 -0.854292 6.0
Special_Conditions_at_Site 4023.0 0.000000 0.000000 0.000000 0.000000 6.000000 6.000000 0.060403 0.505642 83.207777 8.985433 0.0
Speed_limit 4023.0 20.000000 30.000000 30.000000 30.000000 70.000000 50.000000 35.396470 12.026643 2.946248 2.091639 30.0
Time NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13:30
Urban_or_Rural_Area 4023.0 1.000000 1.000000 1.000000 1.000000 2.000000 1.000000 1.188914 0.391489 0.528483 1.590038 1.0
Weather_Conditions 4023.0 1.000000 1.000000 1.000000 1.000000 9.000000 8.000000 1.457122 1.523588 15.731350 4.045185 1.0

The following checks all cells for any null or -values and outputs the count of missing values. It can be seen the columns Junction_Control, 2nd_Road_Class and LSOA_of_Accident_Location contain missing values. It is worth noting that this is slightly misrepresented and will be explored further later.

In [15]:
filtered_df_copy = filtered_df.copy()

# Create a mask for -1 values
minus_one_mask = (filtered_df_copy == -1)

# Create a mask for NaN values
nan_mask = filtered_df_copy.isnull()

# Combine the masks
combined_mask = minus_one_mask | nan_mask

# Count the number of True values in each column
error_counts = combined_mask.sum()

# Filter to display counts where the number of errors is more than 0
error_counts_more_than_zero = error_counts[error_counts > 0]

print(error_counts_more_than_zero)
Junction_Control             1356
2nd_Road_Class               1356
LSOA_of_Accident_Location      11
dtype: int64

2.3 ANALYSIS POST CONVERTING DATATYPES(AS STATED BY DAQING CHEN). -PART OF PRE PROCESSING TOO

The code below imports the Road-Accident-Safety-Data-Guide.xls excel file which contains the mapping/coding for the categorical columns provided in the initial (AccidentLondonBoroughs2223.csv) excel file. It then maps the categorical values and coverts the data type. It applys this to the columns by taking parameters of the column with its respective sheet (in Road-Accident-Safety-Data-Guide.xls) (which contains the code and value).

In [ ]:
data_guide_path = 'Road-Accident-Safety-Data-Guide.xls'
data_guide = pd.read_excel(data_guide_path, sheet_name=None)

# Function to apply the mapping from the data guide to the dataframe
def apply_mapping(df, column, sheet_name):
    # Choose the correct code column ('code' or 'Code')
    code_column = 'code' if 'code' in data_guide[sheet_name].columns else 'Code'

    # Choose the correct label column ('label' or 'Label')
    label_column = 'label' if 'label' in data_guide[sheet_name].columns else 'Label'

    # Create a mapping dictionary from the 'code'/'Code' to the 'label'/'Label'
    mapping_dict = pd.Series(data_guide[sheet_name][label_column].values, index=data_guide[sheet_name][code_column].astype(str)).to_dict()

    # Apply the mapping to the dataframe column
    df[column] = df[column].astype(str).map(mapping_dict).astype('category')


# Apply the mapping for each specified column
apply_mapping(filtered_df, 'Police_Force', 'Police Force')
apply_mapping(filtered_df, 'Accident_Severity', 'Accident Severity')
apply_mapping(filtered_df, 'Day_of_Week', 'Day of Week')
apply_mapping(filtered_df, 'Local_Authority_District', 'Local Authority (District)')
apply_mapping(filtered_df, 'Local_Authority_Highway', 'Local Authority (Highway)')
apply_mapping(filtered_df, '1st_Road_Class', '1st Road Class')
apply_mapping(filtered_df, 'Road_Type', 'Road Type')
apply_mapping(filtered_df, 'Junction_Detail', 'Junction Detail')
apply_mapping(filtered_df, 'Junction_Control', 'Junction Control')
apply_mapping(filtered_df, '2nd_Road_Class', '2nd Road Class')
apply_mapping(filtered_df, 'Pedestrian_Crossing-Human_Control', 'Ped Cross - Human')
apply_mapping(filtered_df, 'Pedestrian_Crossing-Physical_Facilities', 'Ped Cross - Physical')
apply_mapping(filtered_df, 'Light_Conditions', 'Light Conditions')
apply_mapping(filtered_df, 'Weather_Conditions', 'Weather')
apply_mapping(filtered_df, 'Road_Surface_Conditions', 'Road Surface')
apply_mapping(filtered_df, 'Special_Conditions_at_Site', 'Special Conditions at Site')
apply_mapping(filtered_df, 'Carriageway_Hazards', 'Carriageway Hazards')
apply_mapping(filtered_df, 'Urban_or_Rural_Area', 'Urban Rural')
apply_mapping(filtered_df, 'Did_Police_Officer_Attend_Scene_of_Accident', 'Police Officer Attend')

It can also be seen below that the relevant data types have been converted from numerical to categorical

In [17]:
data_types = filtered_df.dtypes

print(data_types)
Accident_Index                                   object
Location_Easting_OSGR                             int64
Location_Northing_OSGR                            int64
Longitude                                       float64
Latitude                                        float64
Police_Force                                   category
Accident_Severity                              category
Number_of_Vehicles                                int64
Number_of_Casualties                              int64
Date                                             object
Day_of_Week                                    category
Time                                             object
Local_Authority_District                       category
Local_Authority_Highway                        category
1st_Road_Class                                 category
1st_Road_Number                                   int64
Road_Type                                      category
Speed_limit                                       int64
Junction_Detail                                category
Junction_Control                               category
2nd_Road_Class                                 category
2nd_Road_Number                                   int64
Pedestrian_Crossing-Human_Control              category
Pedestrian_Crossing-Physical_Facilities        category
Light_Conditions                               category
Weather_Conditions                             category
Road_Surface_Conditions                        category
Special_Conditions_at_Site                     category
Carriageway_Hazards                            category
Urban_or_Rural_Area                            category
Did_Police_Officer_Attend_Scene_of_Accident    category
LSOA_of_Accident_Location                        object
dtype: object

The below outputs the new data frame after having mapped categorical values.

In [18]:
filtered_df.head()
Out[18]:
Accident_Index Location_Easting_OSGR Location_Northing_OSGR Longitude Latitude Police_Force Accident_Severity Number_of_Vehicles Number_of_Casualties Date Day_of_Week Time Local_Authority_District Local_Authority_Highway 1st_Road_Class 1st_Road_Number Road_Type Speed_limit Junction_Detail Junction_Control 2nd_Road_Class 2nd_Road_Number Pedestrian_Crossing-Human_Control Pedestrian_Crossing-Physical_Facilities Light_Conditions Weather_Conditions Road_Surface_Conditions Special_Conditions_at_Site Carriageway_Hazards Urban_or_Rural_Area Did_Police_Officer_Attend_Scene_of_Accident LSOA_of_Accident_Location
7803 200501JI40887 550080 189210 0.164701 51.581719 Metropolitan Police Slight 2 1 09/12/2005 Friday 16:00 Havering Havering A 12 Dual carriageway 30 Crossroads Auto traffic signal C 0 None within 50 metres Pedestrian phase at traffic signal junction Daylight Fine no high winds Wet or damp NaN NaN Urban No E01002329
7855 200501KD00005 551030 189060 0.178337 51.580117 Metropolitan Police Slight 1 1 13/01/2005 Thursday 23:37 Havering Havering A 125 Roundabout 30 Roundabout Give way or uncontrolled A 118 None within 50 metres No physical crossing facilities within 50 metres Darkness - lights lit Fine no high winds Wet or damp NaN NaN Urban Yes E01002243
7856 200501KD00007 554800 188530 0.232471 51.574333 Metropolitan Police Slight 1 1 22/01/2005 Saturday 05:54 Havering Havering Unclassified 0 Single carriageway 30 Not at junction or within 20 metres Data missing or out of range NaN 0 None within 50 metres No physical crossing facilities within 50 metres Darkness - lights lit Fine no high winds Wet or damp NaN NaN Urban Yes E01002271
7857 200501KD00009 552150 183990 0.192297 51.534262 Metropolitan Police Serious 1 1 22/01/2005 Saturday 09:14 Havering Havering Unclassified 0 Single carriageway 30 T or staggered junction Give way or uncontrolled Unclassified 0 None within 50 metres No physical crossing facilities within 50 metres Daylight Fine no high winds Dry NaN NaN Urban Yes E01002372
7858 200501KD00010 555900 191850 0.249805 51.603859 Metropolitan Police Slight 3 5 16/01/2005 Sunday 00:38 Havering Havering A 12 Dual carriageway 50 Not at junction or within 20 metres Data missing or out of range NaN 0 None within 50 metres No physical crossing facilities within 50 metres Darkness - lights lit Fine no high winds Wet or damp NaN NaN Rural Yes E01002296

After further examining the second excel sheet, it became apparent that empty values were not only represented as null or -1 but also as "Unknown", "Data missing or out of range", etc (EXAMPLES OF EXCEL SHOWN BELOW). The code below loops through all columns and outputs the number of missing values. This new code (after converting the datatypes) reveals many more missing values which may need to be addressed.

imageX.png image-Z.png

In [19]:
missing_values_indicators = [
    "Data missing or out of range",
    "Unclassified",
    "Unknown",
    "Unallocated"
]

# Initialize a dictionary to hold the count of missing values for each column
missing_or_null_value_counts = {}

# Iterate over each column in the DataFrame
for column in filtered_df.columns:
    # Count the occurrences of each indicator string and NaN values in the column
    # and sum them to get the total count of missing or null values for that column
    missing_count = filtered_df[column].isnull().sum() \
                    + filtered_df[column].astype(str).isin(missing_values_indicators).sum()
    
    # Add to the dictionary only if the count is greater than 1
    if missing_count > 1:
        missing_or_null_value_counts[column] = missing_count

# Filter the results to only include columns with more than one missing value
missing_or_null_value_counts_filtered = {k: v for k, v in missing_or_null_value_counts.items() if v > 1}

# Convert the counts to a DataFrame for easy viewing, if not empty
if missing_or_null_value_counts_filtered:
    missing_counts_df = pd.DataFrame.from_dict(missing_or_null_value_counts_filtered, orient='index', columns=['Missing/Null Value Count'])
else:
    missing_counts_df = pd.DataFrame(columns=['Missing/Null Value Count'])

# Print or save the DataFrame as needed
print(missing_counts_df)
                            Missing/Null Value Count
1st_Road_Class                                   753
Road_Type                                         80
Junction_Control                                1356
2nd_Road_Class                                  2722
Weather_Conditions                                84
Special_Conditions_at_Site                      3955
Carriageway_Hazards                             3966
LSOA_of_Accident_Location                         11

The code below counts/checks for any duplicate rows based on ID Accident_Index. There are none.

In [20]:
duplicated = filtered_df['Accident_Index'].duplicated().sum()

# If duplicated is greater than 0, there are duplicate values
print(f"Number of duplicate entries in 'Accident_Index': {duplicated}")
Number of duplicate entries in 'Accident_Index': 0

2.4 OUTLIER DETECTION

2.4.1. (NUMERICAL) OUTLIER DETECTION (HISTOGRAM)

Based on the output below, there dont seem to be outliers for the speed limit, thought the frequency of 20mph is now, this does not necessarily mean its an outlier, it is instead likely just regular uncommon data. The number of casualties does have outliers, it could be said that any row where the number of casualties is > 6 is an outlier. This will be addressed later in data preperation

In [21]:
numeric_cols = ['Speed_limit', 'Number_of_Casualties']

for col in numeric_cols:
    plt.figure(figsize=(6, 3))
    plt.hist(filtered_df[col], bins=40, edgecolor='black')
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

2.4.2. (CATEGORICAL) OUTLIER DETECTION (FREQUENCY/OCCURANCES)

Due to the data being categorical, it makes sense to identify outliers using counts. Based on the output below, some outliers are very cliear, for example, the occurence of accdiencts involving 6 or more vehicles is extremely rare. Another example is that, it is very uncommon for there to be more than 6 casualties during an accident.

In [22]:
# List of variables to calculate frequency
variables_to_count = [ 
    'Police_Force', 'Accident_Severity', 'Number_of_Vehicles', 'Number_of_Casualties', 
    'Day_of_Week', '1st_Road_Class', 
    'Road_Type', 'Speed_limit', 'Junction_Detail', 'Junction_Control', 
    '2nd_Road_Class', 'Pedestrian_Crossing-Human_Control', 
    'Pedestrian_Crossing-Physical_Facilities', 'Light_Conditions', 'Weather_Conditions', 
    'Road_Surface_Conditions', 'Special_Conditions_at_Site', 'Carriageway_Hazards', 
    'Urban_or_Rural_Area', 'Did_Police_Officer_Attend_Scene_of_Accident'
]

# Loop through each variable and print its frequency without datatype
# Loop through each variable and print its frequency without datatype and without the index name
for variable in variables_to_count:
    frequency = filtered_df[variable].value_counts()
    frequency.index.name = None  # Set the index name to None to avoid printing it
    print(f"Frequency for {variable}:")
    print(frequency.to_string())  # This will print without the index name
    print("\n")  # Adding a new line for better readability
Frequency for Police_Force:
Metropolitan Police    4023


Frequency for Accident_Severity:
Slight     3539
Serious     452
Fatal        32


Frequency for Number_of_Vehicles:
2     2592
1     1036
3      313
4       64
5       13
8        2
6        1
14       1
10       1


Frequency for Number_of_Casualties:
1     3115
2      663
3      169
4       43
5       19
6        8
7        4
10       1
9        1


Frequency for Day_of_Week:
Friday       639
Tuesday      610
Wednesday    599
Saturday     594
Thursday     585
Monday       566
Sunday       430


Frequency for 1st_Road_Class:
A               1460
C               1255
Unclassified     753
B                305
Motorway         250


Frequency for Road_Type:
Single carriageway    2675
Dual carriageway       826
Roundabout             376
Unknown                 80
Slip road               35
One way street          31


Frequency for Speed_limit:
30    3221
70     322
50     295
40     121
60      59
20       5


Frequency for Junction_Detail:
T or staggered junction                1387
Not at junction or within 20 metres    1356
Crossroads                              460
Roundabout                              384
Private drive or entrance               124
Mini-roundabout                          99
Other junction                           81
Slip road                                78
More than 4 arms (not roundabout)        54


Frequency for Junction_Control:
Give way or uncontrolled        2114
Data missing or out of range    1356
Auto traffic signal              528
Stop sign                         16
Authorised person                  9


Frequency for 2nd_Road_Class:
Unclassified    1366
A                620
C                522
B                118
Motorway          41


Frequency for Pedestrian_Crossing-Human_Control:
None within 50 metres                 4009
Control by school crossing patrol        8
Control by other authorised person       6


Frequency for Pedestrian_Crossing-Physical_Facilities:
No physical crossing facilities within 50 metres                             3345
Pedestrian phase at traffic signal junction                                   352
Pelican, puffin, toucan or similar non-junction pedestrian light crossing     144
Zebra                                                                         140
Central refuge                                                                 23
Footbridge or subway                                                           19


Frequency for Light_Conditions:
Daylight                       2956
Darkness - lights lit          1013
Darkness - lighting unknown      23
Darkness - no lighting           19
Darkness - lights unlit          12


Frequency for Weather_Conditions:
Fine no high winds       3400
Raining no high winds     364
Unknown                    84
Other                      62
Snowing no high winds      41
Fine + high winds          32
Raining + high winds       23
Fog or mist                14
Snowing + high winds        3


Frequency for Road_Surface_Conditions:
Dry                     3229
Wet or damp              708
Frost or ice              52
Snow                      32
Flood over 3cm. deep       2


Frequency for Special_Conditions_at_Site:
Roadworks                                     34
Auto traffic signal - out                     15
Oil or diesel                                  9
Road surface defective                         5
Road sign or marking defective or obscured     3
Auto signal part defective                     2


Frequency for Carriageway_Hazards:
Other object on road                               22
Any animal in carriageway (except ridden horse)    20
Previous accident                                   6
Pedestrian in carriageway - not injured             5
Vehicle load on road                                4


Frequency for Urban_or_Rural_Area:
Urban    3263
Rural     760


Frequency for Did_Police_Officer_Attend_Scene_of_Accident:
Yes                                                                         3181
No                                                                           836
No - accident was reported using a self completion  form (self rep only)       6


2.4.3. (CATEGORICAL) OUTLIER DETECTION (BARCHARTS)

The code below graphs all categorical columns against the number of casualties as a means to identify categorical outliers

In [23]:
# Columns to exclude from the bar charts
exclude_columns = [
    'Number_of_Casualties', 'Accident_Index', 'Location_Easting_OSGR', 
    'Location_Northing_OSGR', 'Longitude', 'Latitude', 'Date', 'Time', 'Police_Force', 'Local_Authority_District', 'Local_Authority_Highway', 'LSOA_of_Accident_Location'
]
for column in filtered_df.columns:
    if column not in exclude_columns:
        group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()

        # Create a bar chart
        group_data.plot(kind='bar', figsize=(3, 1))

        # Set the title and labels
        plt.title(f'Number of Casualties by {column}')
        plt.xlabel(column)
        plt.ylabel('Number_of_Casualties')

        # Show the plot
        plt.show()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
C:\Users\frosty\AppData\Local\Temp\ipykernel_14668\2066257647.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()

2.4.4. CLASS (imbalance) PERCENTAGE OUTPUT

The following code outputs the percentage of each attribute in a column. This useful in determining the business questions as it is not recomended to use columns which have more than 20% missing data

In [24]:
# Columns to exclude
exclude_columns = ['Accident_Index', 'LSOA_of_Accident_Location', 'Time', 'Date']

# Loop through categorical columns except the excluded ones and calculate class percentages
for column in filtered_df.select_dtypes(include=['object', 'category']).columns:
    if column not in exclude_columns:
        # Get the frequency of each class
        class_distribution = filtered_df[column].value_counts()
        
        # Calculate the percentage of each class
        class_percentage = (class_distribution / len(filtered_df)) * 100
        
        print(f"Class percentage for {column}:\n{class_percentage}\n")
        
        # Optionally, you can set a threshold to flag the imbalance, like 80/20 rule.
        if class_percentage.max() > 80:
            print(f"{column} is imbalanced.\n")
Class percentage for Police_Force:
Police_Force
Metropolitan Police    100.0
Name: count, dtype: float64

Police_Force is imbalanced.

Class percentage for Accident_Severity:
Accident_Severity
Slight     87.969177
Serious    11.235396
Fatal       0.795426
Name: count, dtype: float64

Accident_Severity is imbalanced.

Class percentage for Day_of_Week:
Day_of_Week
Friday       15.883669
Tuesday      15.162814
Wednesday    14.889386
Saturday     14.765101
Thursday     14.541387
Monday       14.069103
Sunday       10.688541
Name: count, dtype: float64

Class percentage for Local_Authority_District:
Local_Authority_District
Havering    100.0
Name: count, dtype: float64

Local_Authority_District is imbalanced.

Class percentage for Local_Authority_Highway:
Local_Authority_Highway
Havering    100.0
Name: count, dtype: float64

Local_Authority_Highway is imbalanced.

Class percentage for 1st_Road_Class:
1st_Road_Class
A               36.291325
C               31.195625
Unclassified    18.717375
B                7.581407
Motorway         6.214268
Name: count, dtype: float64

Class percentage for Road_Type:
Road_Type
Single carriageway    66.492667
Dual carriageway      20.531941
Roundabout             9.346259
Unknown                1.988566
Slip road              0.869998
One way street         0.770569
Name: count, dtype: float64

Class percentage for Junction_Detail:
Junction_Detail
T or staggered junction                34.476759
Not at junction or within 20 metres    33.706189
Crossroads                             11.434253
Roundabout                              9.545116
Private drive or entrance               3.082277
Mini-roundabout                         2.460850
Other junction                          2.013423
Slip road                               1.938852
More than 4 arms (not roundabout)       1.342282
Name: count, dtype: float64

Class percentage for Junction_Control:
Junction_Control
Give way or uncontrolled        52.547850
Data missing or out of range    33.706189
Auto traffic signal             13.124534
Stop sign                        0.397713
Authorised person                0.223714
Name: count, dtype: float64

Class percentage for 2nd_Road_Class:
2nd_Road_Class
Unclassified    33.954760
A               15.411385
C               12.975391
B                2.933134
Motorway         1.019140
Name: count, dtype: float64

Class percentage for Pedestrian_Crossing-Human_Control:
Pedestrian_Crossing-Human_Control
None within 50 metres                 99.652001
Control by school crossing patrol      0.198857
Control by other authorised person     0.149142
Name: count, dtype: float64

Pedestrian_Crossing-Human_Control is imbalanced.

Class percentage for Pedestrian_Crossing-Physical_Facilities:
Pedestrian_Crossing-Physical_Facilities
No physical crossing facilities within 50 metres                             83.146905
Pedestrian phase at traffic signal junction                                   8.749689
Pelican, puffin, toucan or similar non-junction pedestrian light crossing     3.579418
Zebra                                                                         3.479990
Central refuge                                                                0.571713
Footbridge or subway                                                          0.472284
Name: count, dtype: float64

Pedestrian_Crossing-Physical_Facilities is imbalanced.

Class percentage for Light_Conditions:
Light_Conditions
Daylight                       73.477504
Darkness - lights lit          25.180214
Darkness - lighting unknown     0.571713
Darkness - no lighting          0.472284
Darkness - lights unlit         0.298285
Name: count, dtype: float64

Class percentage for Weather_Conditions:
Weather_Conditions
Fine no high winds       84.514044
Raining no high winds     9.047974
Unknown                   2.087994
Other                     1.541138
Snowing no high winds     1.019140
Fine + high winds         0.795426
Raining + high winds      0.571713
Fog or mist               0.347999
Snowing + high winds      0.074571
Name: count, dtype: float64

Weather_Conditions is imbalanced.

Class percentage for Road_Surface_Conditions:
Road_Surface_Conditions
Dry                     80.263485
Wet or damp             17.598807
Frost or ice             1.292568
Snow                     0.795426
Flood over 3cm. deep     0.049714
Name: count, dtype: float64

Road_Surface_Conditions is imbalanced.

Class percentage for Special_Conditions_at_Site:
Special_Conditions_at_Site
Roadworks                                     0.845140
Auto traffic signal - out                     0.372856
Oil or diesel                                 0.223714
Road surface defective                        0.124285
Road sign or marking defective or obscured    0.074571
Auto signal part defective                    0.049714
Name: count, dtype: float64

Class percentage for Carriageway_Hazards:
Carriageway_Hazards
Other object on road                               0.546856
Any animal in carriageway (except ridden horse)    0.497141
Previous accident                                  0.149142
Pedestrian in carriageway - not injured            0.124285
Vehicle load on road                               0.099428
Name: count, dtype: float64

Class percentage for Urban_or_Rural_Area:
Urban_or_Rural_Area
Urban    81.108625
Rural    18.891375
Name: count, dtype: float64

Urban_or_Rural_Area is imbalanced.

Class percentage for Did_Police_Officer_Attend_Scene_of_Accident:
Did_Police_Officer_Attend_Scene_of_Accident
Yes                                                                         79.070346
No                                                                          20.780512
No - accident was reported using a self completion  form (self rep only)     0.149142
Name: count, dtype: float64

In [25]:
# Define ANSI escape codes for red color and bold text
RED = '\033[91m'
BOLD = '\033[1m'
RESET = '\033[0m'
# Columns to exclude
exclude_columns = ['Accident_Index', 'LSOA_of_Accident_Location', 'Time', 'Date']

# Loop through categorical columns except the excluded ones and calculate class percentages
for column in filtered_df.select_dtypes(include=['object', 'category']).columns:
    if column not in exclude_columns:
        # Get the frequency of each class
        class_distribution = filtered_df[column].value_counts()
        # Calculate the percentage of each class
        class_percentage = (class_distribution / len(filtered_df)) * 100
        #threshold to flag the imbalance, 80/20 rule.
        if class_percentage.max() > 80:
            print(f"{RED}{BOLD}Warning: {column} is highly imbalanced.{RESET}\n")
Warning: Police_Force is highly imbalanced.

Warning: Accident_Severity is highly imbalanced.

Warning: Local_Authority_District is highly imbalanced.

Warning: Local_Authority_Highway is highly imbalanced.

Warning: Pedestrian_Crossing-Human_Control is highly imbalanced.

Warning: Pedestrian_Crossing-Physical_Facilities is highly imbalanced.

Warning: Weather_Conditions is highly imbalanced.

Warning: Road_Surface_Conditions is highly imbalanced.

Warning: Urban_or_Rural_Area is highly imbalanced.

UNIQUE VALUE RANGES INCOMPARABLE
In [26]:
for column in filtered_df.columns:
    if pd.api.types.is_numeric_dtype(filtered_df[column]):
        # It's a numeric column, so we can calculate min and max
        min_value = filtered_df[column].min()
        max_value = filtered_df[column].max()
        print(f"{column} range: {min_value} to {max_value}")
    else:
        # It's not numeric, so we count unique values instead
        unique_values = filtered_df[column].nunique()
        print(f"{column} has {unique_values} unique values")
Accident_Index has 4023 unique values
Location_Easting_OSGR range: 548990 to 561130
Location_Northing_OSGR range: 180220 to 194700
Longitude range: 0.148362 to 0.322103
Latitude range: 51.499662 to 51.62995
Police_Force has 1 unique values
Accident_Severity has 3 unique values
Number_of_Vehicles range: 1 to 14
Number_of_Casualties range: 1 to 10
Date has 1805 unique values
Day_of_Week has 7 unique values
Time has 997 unique values
Local_Authority_District has 1 unique values
Local_Authority_Highway has 1 unique values
1st_Road_Class has 5 unique values
1st_Road_Number range: 0 to 1459
Road_Type has 6 unique values
Speed_limit range: 20 to 70
Junction_Detail has 9 unique values
Junction_Control has 5 unique values
2nd_Road_Class has 5 unique values
2nd_Road_Number range: 0 to 1459
Pedestrian_Crossing-Human_Control has 3 unique values
Pedestrian_Crossing-Physical_Facilities has 6 unique values
Light_Conditions has 5 unique values
Weather_Conditions has 9 unique values
Road_Surface_Conditions has 5 unique values
Special_Conditions_at_Site has 6 unique values
Carriageway_Hazards has 5 unique values
Urban_or_Rural_Area has 2 unique values
Did_Police_Officer_Attend_Scene_of_Accident has 3 unique values
LSOA_of_Accident_Location has 153 unique values

Script to examine if any value is beyond 3 times of the standard derivation from the mean of the variable Speed_limit and Number of casulties

In [27]:
# Define the columns to check for outliers


# Function to detect outliers based on z-score for numeric columns only
def detect_outliers_z_score(data, column):
    if pd.api.types.is_numeric_dtype(data[column]):
        mean_val = data[column].mean()
        std_val = data[column].std()
        threshold = 3

        # Find outliers
        outliers = data[abs(data[column] - mean_val) > threshold * std_val][column]
        return outliers
    else:
        return pd.Series()  # Return an empty series for non-numeric columns

# Apply the function to each numeric column and store results
outliers_dict = {}
for column in filtered_df:
    if pd.api.types.is_numeric_dtype(filtered_df[column]):
        outliers_dict[column] = detect_outliers_z_score(filtered_df, column)

# Create a DataFrame from the dictionary of outliers
outliers_df = pd.DataFrame.from_dict(outliers_dict, orient='index').transpose()
print(outliers_df)
        Location_Easting_OSGR  Longitude  Number_of_Vehicles  Number_of_Casualties  1st_Road_Number  2nd_Road_Number
34445                561130.0   0.322103                 NaN                   NaN              NaN              NaN
58984                560440.0   0.313329                 NaN                   NaN              NaN              NaN
82519                560140.0        NaN                 NaN                   NaN              NaN              NaN
82809                560190.0        NaN                 NaN                   NaN              NaN              NaN
7870                      NaN        NaN                 4.0                   NaN              NaN              NaN
...                       ...        ...                 ...                   ...              ...              ...
128740                    NaN        NaN                 NaN                   NaN              NaN           1459.0
128743                    NaN        NaN                 NaN                   NaN              NaN           1306.0
128762                    NaN        NaN                 NaN                   NaN              NaN           1306.0
128784                    NaN        NaN                 NaN                   NaN              NaN           1459.0
128907                    NaN        NaN                 NaN                   NaN              NaN           1335.0

[446 rows x 6 columns]
In [28]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder

# Assuming df and categorical_columns are defined as in your script

categorical_columns = [
    'Weather_Conditions', 'Road_Surface_Conditions', 'Light_Conditions',
    '1st_Road_Class', 'Urban_or_Rural_Area', 'Junction_Control',
    'Pedestrian_Crossing-Physical_Facilities', 'Road_Type', 'Accident_Severity'
]

# One-hot encode these categorical columns
encoder = OneHotEncoder()
encoded_categorical = encoder.fit_transform(df[categorical_columns]).toarray()

# Apply PCA to reduce dimensions for visualization
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(encoded_categorical)

# Apply k-means clustering
kmeans = KMeans(n_clusters=5)  # Adjust the number of clusters as needed
kmeans.fit(reduced_data)

# Create a scatter plot of the reduced data with the cluster labels
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=kmeans.labels_, cmap='viridis')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('Clusters from k-means')

# Create a color key (legend) for the clusters
unique_labels = set(kmeans.labels_)
for label in unique_labels:
    plt.scatter([], [], color=plt.cm.viridis(label / max(kmeans.labels_)), label=f'Cluster {label + 1}')
plt.legend()

plt.show()
c:\Users\frosty\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)

3. DATA PREPERATION

THE DATA TYPES WERE CONVERTED TO CATEGORICAL IN DATA UNDERSTANDING AS PER RECOMENDENDATION OF DAQING CHEN

ONVERTING DATATYPES (REPEAT)

The code below imports the Road-Accident-Safety-Data-Guide.xls excel file which contains the mapping/coding for the categorical columns provided in the initial (AccidentLondonBoroughs2223.csv) excel file. It then maps the categorical values and coverts the data type. It applys this to the columns by taking parameters of the column with its respective sheet (in Road-Accident-Safety-Data-Guide.xls) (which contains the code and value).

In [ ]:
data_guide_path = 'Road-Accident-Safety-Data-Guide.xls'
data_guide = pd.read_excel(data_guide_path, sheet_name=None)

# Function to apply the mapping from the data guide to the dataframe
def apply_mapping(df, column, sheet_name):
    # Choose the correct code column ('code' or 'Code')
    code_column = 'code' if 'code' in data_guide[sheet_name].columns else 'Code'

    # Choose the correct label column ('label' or 'Label')
    label_column = 'label' if 'label' in data_guide[sheet_name].columns else 'Label'

    # Create a mapping dictionary from the 'code'/'Code' to the 'label'/'Label'
    mapping_dict = pd.Series(data_guide[sheet_name][label_column].values, index=data_guide[sheet_name][code_column].astype(str)).to_dict()

    # Apply the mapping to the dataframe column
    df[column] = df[column].astype(str).map(mapping_dict).astype('category')


# Apply the mapping for each specified column
apply_mapping(filtered_df, 'Police_Force', 'Police Force')
apply_mapping(filtered_df, 'Accident_Severity', 'Accident Severity')
apply_mapping(filtered_df, 'Day_of_Week', 'Day of Week')
apply_mapping(filtered_df, 'Local_Authority_District', 'Local Authority (District)')
apply_mapping(filtered_df, 'Local_Authority_Highway', 'Local Authority (Highway)')
apply_mapping(filtered_df, '1st_Road_Class', '1st Road Class')
apply_mapping(filtered_df, 'Road_Type', 'Road Type')
apply_mapping(filtered_df, 'Junction_Detail', 'Junction Detail')
apply_mapping(filtered_df, 'Junction_Control', 'Junction Control')
apply_mapping(filtered_df, '2nd_Road_Class', '2nd Road Class')
apply_mapping(filtered_df, 'Pedestrian_Crossing-Human_Control', 'Ped Cross - Human')
apply_mapping(filtered_df, 'Pedestrian_Crossing-Physical_Facilities', 'Ped Cross - Physical')
apply_mapping(filtered_df, 'Light_Conditions', 'Light Conditions')
apply_mapping(filtered_df, 'Weather_Conditions', 'Weather')
apply_mapping(filtered_df, 'Road_Surface_Conditions', 'Road Surface')
apply_mapping(filtered_df, 'Special_Conditions_at_Site', 'Special Conditions at Site')
apply_mapping(filtered_df, 'Carriageway_Hazards', 'Carriageway Hazards')
apply_mapping(filtered_df, 'Urban_or_Rural_Area', 'Urban Rural')
apply_mapping(filtered_df, 'Did_Police_Officer_Attend_Scene_of_Accident', 'Police Officer Attend')

The code below drops all columns not relevent to the business questions stated in section 1

In [29]:
columns_to_keep = [
    'Weather_Conditions', 'Road_Surface_Conditions', 'Light_Conditions',
    'Speed_limit', 'Urban_or_Rural_Area', 'Junction_Detail',
    'Pedestrian_Crossing-Physical_Facilities', 'Road_Type', 'Accident_Severity',
    'Number_of_Casualties'
]
new_df = filtered_df[columns_to_keep].copy()

It an be observed that two of the columns requried for the business problem contain missing values. Through detecting impalance class percentages, it was determined that no more than 20% of the values in each column were missing, meaning it is safe to use said columns.

In [30]:
missing_values_indicators = [
    "Data missing or out of range",
    "Unclassified",
    "Unknown",
    "Unallocated"
]

# Initialize a dictionary to hold the count of missing values for each column
missing_or_null_value_counts = {}

# Iterate over each column in the DataFrame
for column in new_df.columns:
    # Count the occurrences of each indicator string and NaN values in the column
    # and sum them to get the total count of missing or null values for that column
    missing_count = new_df[column].isnull().sum() \
                    + new_df[column].astype(str).isin(missing_values_indicators).sum()
    
    # Add to the dictionary only if the count is greater than 1
    if missing_count > 1:
        missing_or_null_value_counts[column] = missing_count

# Filter the results to only include columns with more than one missing value
missing_or_null_value_counts_filtered = {k: v for k, v in missing_or_null_value_counts.items() if v > 1}

# Convert the counts to a DataFrame for easy viewing, if not empty
if missing_or_null_value_counts_filtered:
    missing_counts_df = pd.DataFrame.from_dict(missing_or_null_value_counts_filtered, orient='index', columns=['Missing/Null Value Count'])
else:
    missing_counts_df = pd.DataFrame(columns=['Missing/Null Value Count'])

# Print or save the DataFrame as needed
print(missing_counts_df)
# missing_counts_df.to_csv('missing_or_null_value_counts.csv')
                    Missing/Null Value Count
Weather_Conditions                        84
Road_Type                                 80

3.1. DETERMINING MODE

The code below outputs the frequency of the two columns containing missing values allowing us to impute them approperiatly. It was considered to replace missing values based on the class percentages and to maintain the same proportions (gaussian distribution). However after further consideration, this method would be the same as if we were to simple ignore missing value. So iw was decided to just use the mode

In [31]:
# List of variables to calculate frequency
variables_to_count = [
    'Weather_Conditions', 'Road_Type'
    
]

# Loop through each variable and print its frequency with value descriptions
for variable in variables_to_count:
    frequency = new_df[variable].value_counts().reset_index()
    frequency.columns = ['Value', 'Frequency']  # Rename columns for clarity
    print(f"Frequency for {variable}:")
    for row in frequency.itertuples(index=False):
        print(f"{row.Value} = {row.Frequency}")
    print("\n")  # Adding a new line for better readability
Frequency for Weather_Conditions:
Fine no high winds = 3400
Raining no high winds = 364
Unknown = 84
Other = 62
Snowing no high winds = 41
Fine + high winds = 32
Raining + high winds = 23
Fog or mist = 14
Snowing + high winds = 3


Frequency for Road_Type:
Single carriageway = 2675
Dual carriageway = 826
Roundabout = 376
Unknown = 80
Slip road = 35
One way street = 31


3.2. REPLACING MISSING VALUES USING MODE

In [32]:
new_df['Weather_Conditions'].replace("Unknown", "Fine no high winds", inplace=True)
In [33]:
new_df['Road_Type'].replace("Unknown", "Single carriageway", inplace=True)

RECHECKING IF MISSING VALUES FILLED

After filling in the missing values, it can be seen that there are no more missing values and that the commands above have executed as expected

In [34]:
missing_values_indicators = [
    "Data missing or out of range",
    "Unclassified",
    "Unknown",
    "Unallocated"
]

# Initialize a dictionary to hold the count of missing values for each column
missing_or_null_value_counts = {}

# Iterate over each column in the DataFrame
for column in new_df.columns:
    # Count the occurrences of each indicator string and NaN values in the column
    # and sum them to get the total count of missing or null values for that column
    missing_count = new_df[column].isnull().sum() \
                    + new_df[column].astype(str).isin(missing_values_indicators).sum()
    
    # Add to the dictionary only if the count is greater than 1
    if missing_count > 1:
        missing_or_null_value_counts[column] = missing_count

# Filter the results to only include columns with more than one missing value
missing_or_null_value_counts_filtered = {k: v for k, v in missing_or_null_value_counts.items() if v > 1}

# Convert the counts to a DataFrame for easy viewing, if not empty
if missing_or_null_value_counts_filtered:
    missing_counts_df = pd.DataFrame.from_dict(missing_or_null_value_counts_filtered, orient='index', columns=['Missing/Null Value Count'])
else:
    missing_counts_df = pd.DataFrame(columns=['Missing/Null Value Count'])

# Print or save the DataFrame as needed
print(missing_counts_df)
# missing_counts_df.to_csv('missing_or_null_value_counts.csv')
Empty DataFrame
Columns: [Missing/Null Value Count]
Index: []

3.3. DEALING WITH OUTLIERS (NUMERICAL)

In order to find outliers for the numerical columns,

In [35]:
numeric_cols = ['Speed_limit', 'Number_of_Casualties']

for col in numeric_cols:
    plt.figure(figsize=(5, 2))
    plt.hist(filtered_df[col], bins=40, edgecolor='black')
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

The code below outputs the count of eacy type for easier readability

In [36]:
# List of variables to calculate frequency
variables_to_count = [
    'Number_of_Casualties'
    
]

# Loop through each variable and print its frequency with value descriptions
for variable in variables_to_count:
    frequency = new_df[variable].value_counts().reset_index()
    frequency.columns = ['Value', 'Frequency']  # Rename columns for clarity
    print(f"Frequency for {variable}:")
    for row in frequency.itertuples(index=False):
        print(f"{row.Value} = {row.Frequency}")
    print("\n")  # Adding a new line for better readability
Frequency for Number_of_Casualties:
1 = 3115
2 = 663
3 = 169
4 = 43
5 = 19
6 = 8
7 = 4
10 = 1
9 = 1


The rows where there are nunber of casulaties more than 8 have been dropped. Though other columns has types where the frequency was low, i.e snow. It does not necessearly mean that they are outliers, it could simply be an uncommon occurance. Another example of this is the 20mph speed limit. It is rare, but not an outlier.

In [37]:
# Assuming 'filtered_df' is your DataFrame
new_df = new_df[new_df['Number_of_Casualties'] < 8]

# This will keep rows where 'Number_of_Casualties' is less than 8

After running the count loop again, i was able to confirm that the rows have dropped as expected

In [38]:
# List of variables to calculate frequency
variables_to_count = [
    'Number_of_Casualties'
    
]

# Loop through each variable and print its frequency with value descriptions
for variable in variables_to_count:
    frequency = new_df[variable].value_counts().reset_index()
    frequency.columns = ['Value', 'Frequency']  # Rename columns for clarity
    print(f"Frequency for {variable}:")
    for row in frequency.itertuples(index=False):
        print(f"{row.Value} = {row.Frequency}")
    print("\n")  # Adding a new line for better readability
Frequency for Number_of_Casualties:
1 = 3115
2 = 663
3 = 169
4 = 43
5 = 19
6 = 8
7 = 4


3.3. DEALING WITH OUTLIERS - Script to output any value that are beyond 3 times the standard derivation from the mean of the variable (Numerical)

In [39]:
# Calculate Z-scores of each column in the DataFrame
z_scores = stats.zscore(new_df.select_dtypes(include=[np.number]))

# Convert Z-scores to a DataFrame
z_scores_df = pd.DataFrame(z_scores, columns=new_df.select_dtypes(include=[np.number]).columns)

# Define a threshold for identifying outliers
threshold = 3

# Find where Z-scores are above the threshold
outliers = (z_scores_df.abs() > threshold).any(axis=1)

# Print the rows in the DataFrame that contain outliers
outliers_df = new_df[outliers]
display(outliers_df)
Weather_Conditions Road_Surface_Conditions Light_Conditions Speed_limit Urban_or_Rural_Area Junction_Detail Pedestrian_Crossing-Physical_Facilities Road_Type Accident_Severity Number_of_Casualties
7858 Fine no high winds Wet or damp Darkness - lights lit 50 Rural Not at junction or within 20 metres No physical crossing facilities within 50 metres Dual carriageway Slight 5
8147 Fine no high winds Dry Daylight 70 Rural Not at junction or within 20 metres No physical crossing facilities within 50 metres Dual carriageway Slight 7
8237 Fine no high winds Dry Daylight 70 Rural Not at junction or within 20 metres No physical crossing facilities within 50 metres Dual carriageway Serious 4
8297 Fine no high winds Dry Darkness - lights lit 60 Rural T or staggered junction No physical crossing facilities within 50 metres Dual carriageway Slight 4
8348 Fine no high winds Dry Daylight 30 Urban T or staggered junction No physical crossing facilities within 50 metres Single carriageway Slight 4
... ... ... ... ... ... ... ... ... ... ...
128669 Fine no high winds Dry Darkness - lights lit 30 Rural Not at junction or within 20 metres No physical crossing facilities within 50 metres Single carriageway Slight 4
128695 Fine no high winds Dry Daylight 30 Urban T or staggered junction No physical crossing facilities within 50 metres Single carriageway Slight 4
128759 Fine no high winds Dry Darkness - lights lit 30 Urban T or staggered junction No physical crossing facilities within 50 metres Single carriageway Slight 7
128837 Fine no high winds Dry Daylight 30 Urban T or staggered junction Pedestrian phase at traffic signal junction Single carriageway Slight 5
128858 Fine no high winds Dry Darkness - lights lit 30 Urban Crossroads Pedestrian phase at traffic signal junction Single carriageway Slight 4

74 rows × 10 columns

I then dropped any rows (see below) where that fit the ctireria above

In [51]:
new_df = new_df[~outliers]

ENSURING ROWS HAVE DROPPED

It can be seen below that this was done successfully.

In [52]:
# Calculate Z-scores of each column in the DataFrame
z_scores = stats.zscore(new_df.select_dtypes(include=[np.number]))

# Convert Z-scores to a DataFrame
z_scores_df = pd.DataFrame(z_scores, columns=new_df.select_dtypes(include=[np.number]).columns)

# Define a threshold for identifying outliers
threshold = 3

# Find where Z-scores are above the threshold
outliers = (z_scores_df.abs() > threshold).any(axis=1)

# Print the rows in the DataFrame that contain outliers
outliers_df = new_df[outliers]
display(outliers_df)
Weather_Conditions Road_Surface_Conditions Light_Conditions Speed_limit Urban_or_Rural_Area Junction_Detail Pedestrian_Crossing-Physical_Facilities Road_Type Accident_Severity Number_of_Casualties

The code below outputs any imbalanced classes along with their percentages. The threshold is set to 80%. The second code snippet provides a list of columns which are unbalanced in a human readable format

In [42]:
# Assuming 'filtered_df' is your DataFrame and it's already loaded with data

# Loop through categorical columns and calculate class percentages
for column in new_df.select_dtypes(include=['object', 'category']).columns:
    # Get the frequency of each class
    class_distribution = new_df[column].value_counts()
    
    # Calculate the percentage of each class
    class_percentage = (class_distribution / len(new_df)) * 100
    
    # Print the class percentages
    print(f"Class percentage for {column}:\n{class_percentage}\n")
    
    # Optionally, you can set a threshold to flag the imbalance
    if class_percentage.max() > 80:
        print(f"Warning: {column} is imbalanced. The majority class '{class_distribution.idxmax()}' accounts for {class_percentage.max()}% of the data. \n\n\n")
Class percentage for Weather_Conditions:
Weather_Conditions
Fine no high winds       86.648087
Raining no high winds     9.019508
Other                     1.545478
Snowing no high winds     1.013428
Fine + high winds         0.785407
Raining + high winds      0.582721
Fog or mist               0.329364
Snowing + high winds      0.076007
Name: count, dtype: float64

Warning: Weather_Conditions is imbalanced. The majority class 'Fine no high winds' accounts for 86.64808715480112% of the data. 



Class percentage for Road_Surface_Conditions:
Road_Surface_Conditions
Dry                     80.339498
Wet or damp             17.481632
Frost or ice             1.317456
Snow                     0.810742
Flood over 3cm. deep     0.050671
Name: count, dtype: float64

Warning: Road_Surface_Conditions is imbalanced. The majority class 'Dry' accounts for 80.33949835317962% of the data. 



Class percentage for Light_Conditions:
Light_Conditions
Daylight                       73.828224
Darkness - lights lit          24.854320
Darkness - lighting unknown     0.582721
Darkness - no lighting          0.430707
Darkness - lights unlit         0.304028
Name: count, dtype: float64

Class percentage for Urban_or_Rural_Area:
Urban_or_Rural_Area
Urban    81.606283
Rural    18.393717
Name: count, dtype: float64

Warning: Urban_or_Rural_Area is imbalanced. The majority class 'Urban' accounts for 81.60628325310361% of the data. 



Class percentage for Junction_Detail:
Junction_Detail
T or staggered junction                34.532556
Not at junction or within 20 metres    33.671143
Crossroads                             11.426400
Roundabout                              9.627565
Private drive or entrance               3.116291
Mini-roundabout                         2.432227
Other junction                          2.001520
Slip road                               1.824170
More than 4 arms (not roundabout)       1.368128
Name: count, dtype: float64

Class percentage for Pedestrian_Crossing-Physical_Facilities:
Pedestrian_Crossing-Physical_Facilities
No physical crossing facilities within 50 metres                             83.126425
Pedestrian phase at traffic signal junction                                   8.690144
Pelican, puffin, toucan or similar non-junction pedestrian light crossing     3.572333
Zebra                                                                         3.546998
Central refuge                                                                0.582721
Footbridge or subway                                                          0.481378
Name: count, dtype: float64

Warning: Pedestrian_Crossing-Physical_Facilities is imbalanced. The majority class 'No physical crossing facilities within 50 metres' accounts for 83.12642513301242% of the data. 



Class percentage for Road_Type:
Road_Type
Single carriageway    68.786420
Dual carriageway      20.268558
Roundabout             9.399544
Slip road              0.810742
One way street         0.734735
Name: count, dtype: float64

Class percentage for Accident_Severity:
Accident_Severity
Slight     88.294908
Serious    10.970357
Fatal       0.734735
Name: count, dtype: float64

Warning: Accident_Severity is imbalanced. The majority class 'Slight' accounts for 88.29490752470231% of the data. 



Though i was able to identify imbalanced classes. Due to limitations of knowledge, i was unable to successfully balance these columns without signficantly manipluating the data. SMOTE was considered, however it is generally only used for numerical classes/columns. As a work around, the chosen model will take in to account weighted sampling, as so to bypass the problem of imbalanced classes

In [43]:
# Define ANSI escape codes for red color and bold text
RED = '\033[91m'
BOLD = '\033[1m'
RESET = '\033[0m'
# Columns to exclude
exclude_columns = ['Accident_Index', 'LSOA_of_Accident_Location', 'Time', 'Date']

# Loop through categorical columns except the excluded ones and calculate class percentages
for column in new_df.select_dtypes(include=['object', 'category']).columns:
    if column not in exclude_columns:
        # Get the frequency of each class
        class_distribution = new_df[column].value_counts()
        # Calculate the percentage of each class
        class_percentage = (class_distribution / len(new_df)) * 100
        #threshold to flag the imbalance, 80/20 rule.
        if class_percentage.max() > 80:
            print(f"{RED}{BOLD}Warning: {column} is highly imbalanced.{RESET}\n")
Warning: Weather_Conditions is highly imbalanced.

Warning: Road_Surface_Conditions is highly imbalanced.

Warning: Urban_or_Rural_Area is highly imbalanced.

Warning: Pedestrian_Crossing-Physical_Facilities is highly imbalanced.

Warning: Accident_Severity is highly imbalanced.

3.4. ONE-HOT ENCODING - PREPROCESSING

One-hot encoding is the process where categorical variables are converted into a form that could be inputted into a ML algorithm (hackernoon.com, n.d.). It typically involves expanding columns and converting them in to a binary form. The code below transformes the needed categorical columns into binary features and stores it as encoded_df. One hot encoding is typically only applied to categorical types

In [44]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Define the columns to be one-hot encoded
categorical_cols = ['Weather_Conditions', 'Road_Surface_Conditions', 'Light_Conditions', 
                    'Urban_or_Rural_Area', 'Junction_Detail', 'Pedestrian_Crossing-Physical_Facilities', 
                    'Road_Type']

# Perform one-hot encoding
onehotencoder = OneHotEncoder(sparse=False)
encoded_features = onehotencoder.fit_transform(new_df[categorical_cols])
encoded_feature_names = onehotencoder.get_feature_names_out(categorical_cols)
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)

# Reset index to align the original and encoded DataFrames
new_df.reset_index(drop=True, inplace=True)

# Concatenate the one-hot encoded columns with the original DataFrame
new_df_encoded = pd.concat([new_df.drop(categorical_cols, axis=1), encoded_df], axis=1)
c:\Users\frosty\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\preprocessing\_encoders.py:975: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(

3.5. NORMALISATION - PREPROCESSING

The code below normalises the numerical columns speed limit

In [45]:
from sklearn.preprocessing import MinMaxScaler

# Normalize numerical columns
scaler = MinMaxScaler()
new_df_encoded['Speed_limit'] = scaler.fit_transform(new_df_encoded[['Speed_limit']])

3.6. DATA SPLITTING - PREPROCESSING

The datasets are then split up in to three tables based on the three business questions identified in section 1

In [46]:
from sklearn.model_selection import train_test_split


# For Question 1: Impact of conditions on accident severity
features_q1 = new_df_encoded.filter(regex='Weather_|Road_|Light_|Speed_limit')
target_q1 = new_df_encoded['Accident_Severity']

# For Question 2: Impact on pedestrian safety and number of casualties
features_q2 = new_df_encoded.filter(regex='Road_Type|Junction_|Pedestrian_')
target_q2 = new_df_encoded['Number_of_Casualties']

# For Question 3: Severity of accidents in different road classes and impact of speed limits
features_q3 = new_df_encoded.filter(regex='Road_Type|Speed_limit|Urban_or_Rural_')
target_q3 = new_df_encoded['Accident_Severity']

# Split the data into training and testing sets for each business question
X_train_q1, X_test_q1, y_train_q1, y_test_q1 = train_test_split(features_q1, target_q1, test_size=0.2, random_state=42)
X_train_q2, X_test_q2, y_train_q2, y_test_q2 = train_test_split(features_q2, target_q2, test_size=0.2, random_state=42)
X_train_q3, X_test_q3, y_train_q3, y_test_q3 = train_test_split(features_q3, target_q3, test_size=0.2, random_state=42)

4.1 MODELLING

MODELLING - Q1

The following code utilises a Random Forest classifier from scikit-learn. It focuses on predicting 'Accident_Severity' based on features related to weather conditions, road surface conditions, and light conditions. The code preprocesses the data, splits it into training and testing sets, and employs a Random Forest Classifier with balanced class weights to account for imbalanced classes. It then outputs the feature importances to assess variable significance. The code also evaluates the model's performance using a classification report.

In [47]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd

# Assuming 'new_df_encoded' is your preprocessed DataFrame ready for modeling
# and 'target_q1' is your target variable for question 1

# Define the features for Question 1
features_q1 = new_df_encoded.filter(regex='Weather_Conditions|Road_Surface_Conditions|Light_Conditions')

# Define the target variable for Question 1
target_q1 = new_df_encoded['Accident_Severity']

# Split the dataset into training and testing sets
X_train_q1, X_test_q1, y_train_q1, y_test_q1 = train_test_split(
    features_q1, 
    target_q1, 
    test_size=0.2, 
    random_state=42
)

# Initialize the Random Forest Classifier with class weights
rf_classifier_q1 = RandomForestClassifier(class_weight='balanced', random_state=42)

# Train the classifier
rf_classifier_q1.fit(X_train_q1, y_train_q1)

# Get the feature importances
feature_importances = rf_classifier_q1.feature_importances_

# Create a DataFrame from the feature importances
feature_importances_df = pd.DataFrame({
    'Feature': features_q1.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False).reset_index(drop=True)

# Display the feature importances DataFrame
display(feature_importances_df)

# Predict on the test set
y_pred_q1 = rf_classifier_q1.predict(X_test_q1)

# Evaluate the predictions
classification_report_q1 = classification_report(y_test_q1, y_pred_q1)

print(classification_report_q1)
Feature Importance
0 Light_Conditions_Daylight 0.153369
1 Light_Conditions_Darkness - lights lit 0.135899
2 Weather_Conditions_Fine no high winds 0.123911
3 Weather_Conditions_Fine + high winds 0.117760
4 Road_Surface_Conditions_Dry 0.110452
5 Road_Surface_Conditions_Wet or damp 0.091499
6 Weather_Conditions_Raining no high winds 0.070967
7 Road_Surface_Conditions_Frost or ice 0.048752
8 Weather_Conditions_Other 0.034811
9 Light_Conditions_Darkness - no lighting 0.025513
10 Weather_Conditions_Snowing no high winds 0.017084
11 Weather_Conditions_Raining + high winds 0.016095
12 Light_Conditions_Darkness - lighting unknown 0.014569
13 Road_Surface_Conditions_Snow 0.012552
14 Light_Conditions_Darkness - lights unlit 0.010615
15 Weather_Conditions_Fog or mist 0.008867
16 Road_Surface_Conditions_Flood over 3cm. deep 0.004821
17 Weather_Conditions_Snowing + high winds 0.002463
              precision    recall  f1-score   support

       Fatal       0.01      0.17      0.01         6
     Serious       0.13      0.08      0.10        83
      Slight       0.89      0.75      0.82       701

    accuracy                           0.68       790
   macro avg       0.35      0.34      0.31       790
weighted avg       0.81      0.68      0.74       790

MODELLING - Q2

The code below also uses Random Forest to predict the Number_of_Casualties. It uses features related to 'Road_Type,' 'Junction_Detail,' and 'Pedestrian_Crossing-Physical_Facilities' from the preprocessed DataFrame 'new_df_encoded.'

In [48]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd

# Assuming 'new_df_encoded' is your preprocessed DataFrame ready for modeling
# and 'target_q2' is your target variable for question 2

# Define the features for Question 2
features_q2 = new_df_encoded.filter(regex='Road_Type|Junction_Detail|Pedestrian_Crossing-Physical_Facilities')

# Define the target variable for Question 2
target_q2 = new_df_encoded['Number_of_Casualties']

# Split the dataset into training and testing sets
X_train_q2, X_test_q2, y_train_q2, y_test_q2 = train_test_split(
    features_q2, 
    target_q2, 
    test_size=0.2, 
    random_state=42
)

# Initialize the Random Forest Classifier with class weights
rf_classifier_q2 = RandomForestClassifier(
    n_estimators=200,
    max_depth=20,
    min_samples_split=5,
    class_weight='balanced_subsample',
    random_state=42
)

# Train the classifier
rf_classifier_q2.fit(X_train_q2, y_train_q2)

# Get the feature importances
feature_importances_q2 = rf_classifier_q2.feature_importances_

# Create a DataFrame from the feature importances
feature_importances_df_q2 = pd.DataFrame({
    'Feature': features_q2.columns,
    'Importance': feature_importances_q2
}).sort_values(by='Importance', ascending=False).reset_index(drop=True)

# Display the feature importances DataFrame
display(feature_importances_df_q2)

# Predict on the test set
y_pred_q2 = rf_classifier_q2.predict(X_test_q2)

# Evaluate the predictions
classification_report_q2 = classification_report(y_test_q2, y_pred_q2)

print(classification_report_q2)
Feature Importance
0 Road_Type_Single carriageway 0.075854
1 Road_Type_Dual carriageway 0.072006
2 Junction_Detail_T or staggered junction 0.071935
3 Road_Type_Slip road 0.069871
4 Pedestrian_Crossing-Physical_Facilities_No phy... 0.068963
5 Junction_Detail_Roundabout 0.066526
6 Junction_Detail_Not at junction or within 20 m... 0.064475
7 Junction_Detail_Crossroads 0.063512
8 Pedestrian_Crossing-Physical_Facilities_Pedest... 0.056695
9 Junction_Detail_Other junction 0.048474
10 Pedestrian_Crossing-Physical_Facilities_Pelica... 0.044151
11 Road_Type_Roundabout 0.043795
12 Pedestrian_Crossing-Physical_Facilities_Zebra 0.038393
13 Junction_Detail_Mini-roundabout 0.036860
14 Junction_Detail_More than 4 arms (not roundabout) 0.036147
15 Junction_Detail_Slip road 0.032244
16 Road_Type_One way street 0.031027
17 Pedestrian_Crossing-Physical_Facilities_Centra... 0.029256
18 Junction_Detail_Private drive or entrance 0.025606
19 Pedestrian_Crossing-Physical_Facilities_Footbr... 0.024210
              precision    recall  f1-score   support

           1       0.82      0.72      0.76       639
           2       0.12      0.07      0.08       122
           3       0.06      0.34      0.10        29

    accuracy                           0.60       790
   macro avg       0.33      0.38      0.32       790
weighted avg       0.68      0.60      0.63       790

MODELLING - Q3

The final mode uses Logistic Regression model from scikit-learn. It focuses on predicting 'Accident_Severity using features related to 'Road_Type,' 'Speed_limit,' and 'Urban_or_Rural' from the preprocessed DataFrame 'new_df_encoded.'.

A linear regression model can also be represented as a network (see below) (Daqing.C, 2023)

image.png

In [49]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd

# Assuming 'new_df_encoded' is your preprocessed DataFrame ready for modeling
# and 'target_q3' is your target variable for question 3

# Define the features for Question 3
features_q3 = new_df_encoded.filter(regex='Road_Type|Speed_limit|Urban_or_Rural')

# Define the target variable for Question 3
target_q3 = new_df_encoded['Accident_Severity']

# Split the dataset into training and testing sets
X_train_q3, X_test_q3, y_train_q3, y_test_q3 = train_test_split(
    features_q3, 
    target_q3, 
    test_size=0.2, 
    random_state=42
)

# Initialize the Logistic Regression model with class weights and regularization
logreg_q3 = LogisticRegression(
    class_weight='balanced',
    max_iter=5000,
    C=0.5,  # Regularization strength
    random_state=42
)

# Train the model
logreg_q3.fit(X_train_q3, y_train_q3)

# Get the model's coefficients
feature_coefficients = logreg_q3.coef_[0]

# Create a DataFrame for the coefficients
coefficients_df = pd.DataFrame({
    'Feature': features_q3.columns,
    'Coefficient': feature_coefficients
}).sort_values(by='Coefficient', ascending=False).reset_index(drop=True)

# Display the coefficients DataFrame
display(coefficients_df)

# Predict on the test set
y_pred_q3 = logreg_q3.predict(X_test_q3)

# Evaluate the predictions
classification_report_q3 = classification_report(y_test_q3, y_pred_q3)

print(classification_report_q3)
Feature Coefficient
0 Road_Type_Slip road 0.640741
1 Urban_or_Rural_Area_Rural 0.411732
2 Road_Type_Roundabout 0.149541
3 Road_Type_Single carriageway 0.076267
4 Speed_limit 0.025558
5 Road_Type_Dual carriageway -0.086287
6 Urban_or_Rural_Area_Urban -0.411883
7 Road_Type_One way street -0.780414
              precision    recall  f1-score   support

       Fatal       0.03      0.67      0.05         6
     Serious       0.11      0.67      0.18        83
      Slight       0.95      0.15      0.26       701

    accuracy                           0.21       790
   macro avg       0.36      0.50      0.16       790
weighted avg       0.85      0.21      0.25       790

4.2. EVALUATION AND CONCLUSION

In [66]:
print ("Business Question 1")
print(classification_report_q1)
print("\n""\n""\n""\n")


print ("Business Question 2")
print(classification_report_q2)
print("\n""\n""\n""\n")


print ("Business Question 3")
print(classification_report_q3)
print()
Business Question 1
              precision    recall  f1-score   support

       Fatal       0.01      0.17      0.01         6
     Serious       0.13      0.08      0.10        83
      Slight       0.89      0.75      0.82       701

    accuracy                           0.68       790
   macro avg       0.35      0.34      0.31       790
weighted avg       0.81      0.68      0.74       790






Business Question 2
              precision    recall  f1-score   support

           1       0.82      0.72      0.76       639
           2       0.12      0.07      0.08       122
           3       0.06      0.34      0.10        29

    accuracy                           0.60       790
   macro avg       0.33      0.38      0.32       790
weighted avg       0.68      0.60      0.63       790






Business Question 3
              precision    recall  f1-score   support

       Fatal       0.03      0.67      0.05         6
     Serious       0.11      0.67      0.18        83
      Slight       0.95      0.15      0.26       701

    accuracy                           0.21       790
   macro avg       0.36      0.50      0.16       790
weighted avg       0.85      0.21      0.25       790


Question 1:


Model Used: Random Forest Classifier
Insights: The feature importances from the Random Forest model suggest that certain weather conditions, like "Fine + high winds," and light conditions, such as "Darkness - no lights," are significant predictors of accident severity. This indicates that extreme weather and poor lighting may contribute to more severe accidents.

Model Performance Metrics:

  • Accuracy: 0.68 (However, this metric is misleading due to class imbalance)
  • Macro Average:
    • Precision: 0.35
    • Recall: 0.34
    • F1-score: 0.31
  • Weighted Average:
    • Precision: 0.81
    • Recall: 0.68
    • F1-score: 0.74

Class-specific Metrics:

  • Fatal:
    • Precision: 0.01
    • Recall: 0.17
    • F1-score: 0.01
  • Serious:
    • Precision: 0.13
    • Recall: 0.08
    • F1-score: 0.10
  • Slight:
    • Precision: 0.89
    • Recall: 0.75
    • F1-score: 0.82

The model shows a significant imbalance with a tendency to correctly predict 'Slight' severity accidents, while 'Fatal' and 'Serious' severities have very low precision and recall.

Question 2:


Model Used: Random Forest Classifier (with Grid Search CV for hyperparameter tuning)
Insights: The model has likely identified specific road types and pedestrian crossing facilities that correlate with an increased number of casualties. For example, certain types of junctions or crossings without adequate safety measures might see higher casualty numbers.

Model Performance Metrics:

  • Accuracy: 0.60
  • Macro Average:
    • Precision: 0.33
    • Recall: 0.38
    • F1-score: 0.32
  • Weighted Average:
    • Precision: 0.68
    • Recall: 0.60
    • F1-score: 0.63

Class-specific Metrics:

  • 1 casualty:
    • Precision: 0.82
    • Recall: 0.72
    • F1-score: 0.76
  • 2 casualties:
    • Precision: 0.12
    • Recall: 0.07
    • F1-score: 0.08
  • 3 casualties:
    • Precision: 0.06
    • Recall: 0.34
    • F1-score: 0.10

The model performs well for predicting single-casualty incidents but performs poorly for incidents with more casualties.

Question 3:


Model Used: Logistic Regression
Insights: The coefficients from the Logistic Regression model can be interpreted to understand the impact of different road classes and speed limits on the severity of accidents. For instance, higher speed limits in urban areas might be associated with a certain level of accident severity.

Model Performance Metrics:

  • Accuracy: 0.21
  • Macro Average:
    • Precision: 0.36
    • Recall: 0.50
    • F1-score: 0.16
  • Weighted Average:
    • Precision: 0.85
    • Recall: 0.21
    • F1-score: 0.25

Class-specific Metrics:

  • Fatal:
    • Precision: 0.03
    • Recall: 0.67
    • F1-score: 0.05
  • Serious:
    • Precision: 0.11
    • Recall: 0.67
    • F1-score: 0.18
  • Slight:
    • Precision: 0.95
    • Recall: 0.15
    • F1-score: 0.26

The model has low accuracy and is heavily biased towards predicting the majority class 'Slight'. It demonstrates a high recall for 'Fatal' and 'Serious' but with very low precision, indicating a high number of false positives for these classes.

5. Future Research

  • How does the inclusion of temporal data (time of day, day of the week, season) influence the prediction of accident severity and casualty numbers?
  • Are there specific geographic areas that are more prone to severe accidents, and can we identify any spatial patterns?
  • What role could emerging technologies, such as autonomous vehicles or advanced driver-assistance systems (ADAS), play in reducing road accidents and their severity?
  • What is the impact of climate change on road safety

5. REFERENCES

[1] Romford Recorder. (2023). Petition for lollipop people outside two Havering schools rejected. [online] Available at: https://www.romfordrecorder.co.uk/news/23690203.havering-council-rejects-petition-school-lollipop-people/ [Accessed 4th Nov. 2023].

[2] hackernoon.com. (n.d.). What is One Hot Encoding? Why And When do you have to use it? | Hacker Noon. [online] Available at: https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f.

[3] Daqing. C (2023), 09 Regression, [PowerPoint], Computer Science 4637_2122, London South Bank University, delivered week 9.

[4] Chris, K. (2022). Database Normalization – Normal Forms 1nf 2nf 3nf Table Examples. [online] freeCodeCamp.org. Available at: https://www.freecodecamp.org/news/database-normalization-1nf-2nf-3nf-table-examples/.

‌[5] Scikit-learn.org. (2014). sklearn.linear_model.LogisticRegression — scikit-learn 0.21.2 documentation. [online] Available at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

‌[6] Real Python (2019). Linear Regression in Python. [online] Realpython.com. Available at: https://realpython.com/linear-regression-in-python/.

[7] ‌funnel.io. (n.d.). What is data mapping? Data mapping explained (with examples). [online] Available at: https://funnel.io/blog/your-guide-to-data-mapping [Accessed 1st Dec. 2023].

‌[8] GeeksforGeeks. (2020). Box Plot in Python using Matplotlib. [online] Available at: https://www.geeksforgeeks.org/box-plot-in-python-using-matplotlib/.

[9] www.w3schools.com. (n.d.). Python Machine Learning - K-means. [online] Available at: https://www.w3schools.com/python/python_ml_k-means.asp.